Introduction

This document will serve as a tutorial for using SCISSORS, with the added functionality of detailing exactly how the PBMC3k dataset figures presented in our manuscript were generated. We’ll start from a 10X counts matrix and end with fully annotated cell clusters. In addition to R, in order to run all the code in this document you’ll need a Python 3 installation with openTSNE and all its dependencies installed.

Libraries

R

library(pals)        # basic colors
library(dplyr)       # tidy data manipulation
library(Seurat)      # single cell infrastructure
library(ggplot2)     # plots
library(SCISSORS)    # our package 
library(paletteer)   # advanced colors
library(reticulate)  # Python interface
library(SeuratData)  # PBMC3k dataset

Python

import numpy as np
from openTSNE import TSNEEmbedding
from openTSNE import initialization
from openTSNE.affinity import Multiscale
from openTSNE.affinity import PerplexityBasedNN

Data

We load a scRNA-seq dataset provided by 10X Genomics that consists of 2,700 peripheral blood mononuclear cells (PBMCs) from a healthy donor.

pbmc <- LoadData("pbmc3k")

Preprocessing

Here we use PrepareData() to normalize expression & select highly variable genes through sctransform, run PCA & t-SNE. and cluster our cells. We utilize 15 principal components even though the Satija Lab vignette used 10, as we use sctransform normalization, which does a better job of retaining biological heterogeneity through normalization than standard log-normalization.

pbmc <- PrepareData(pbmc, 
                    n.variable.genes = 4000, 
                    n.PC = 15, 
                    which.dim.reduc = "tsne",
                    initial.resolution = .4, 
                    random.seed = 629)
## [1] "Normalizing counts using SCTransform"
## [1] "Running t-SNE on 15 principal components with perplexity = 30"
## [1] "Clustering cells in PCA space using k ~ 52 & resolution = 0.4"
## [1] "Found 6 unique clusters"
p0 <- DimPlot(pbmc, cols = tableau20()) + 
      labs(x = "t-SNE 1", y = "t-SNE 2") + 
      theme_yehlab() + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p0

Fit-SNE

We’ll also run the Fast Fourier Transform-accelerated version of t-SNE as implemented in the Python library openTSNE. First we’ll need to get our PC matrix into a form accessible by our Python interpreter.

pc_mat <- Embeddings(pbmc, reduction = "pca")
# import PC matrix
pc_mat = np.array(r.pc_mat)
# run Fit-SNE w/ multiscale kernel
init = initialization.pca(pc_mat, random_state=629)
affin_anneal = PerplexityBasedNN(pc_mat, perplexity=100, metric='cosine', random_state=629)
tsne = TSNEEmbedding(init, affin_anneal, negative_gradient_method='fft')
embed1 = tsne.optimize(n_iter=250, exaggeration=10, momentum=0.6)
embed2 = embed1.optimize(n_iter=750, exaggeration=1, momentum=0.8)
affin_anneal.set_perplexity(30)
embed3 = embed2.optimize(n_iter=15, exaggeration=2, momentum=0.8)
embed4 = embed3.optimize(n_iter=250)

Now we pull the results back into R, and save them in our Seurat object. We save the original embedding (made using the default Barnes-Hut t-SNE implementation) in pbmc@reduction$bh_tsne - we need to do this in order to make the Fit-SNE embedding the default embedding that will be retrieved in calls to functions such as DimPlot() or FeaturePlot().

embed <- as.matrix(py$embed4)
rownames(embed) <- colnames(pbmc)
colnames(embed) <- c("Fit-SNE_1", "Fit-SNE_2")
pbmc@reductions$bh_tsne <- pbmc@reductions$tsne
pbmc@reductions$tsne <- CreateDimReducObject(embeddings = embed, 
                                             key = "FitSNE_", 
                                             assay = "SCT", 
                                             global = TRUE)

Visualizing the results shows pretty much the same global structure as with the default t-SNE implementation, albeit rotated a bit, but I like that Fit-SNE’s clusters are a bit denser, so we’ll use Fit-SNE going forward.

p1 <- DimPlot(pbmc, cols = tableau20()) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p1

Reclustering

Here we’ll examine clusters 0, 1, and 4. Cluster 0 seems large enough, and has enough variability on the X-axis of the t-SNE embedding to appear as though it might harbor subgroups. Clusters 1 and 4 have small but visible subclusters.

reclust_res <- ReclusterCells(pbmc, 
                              which.clust = list(0, 1, 4), 
                              n.variable.genes = 4000,
                              n.PC = 15, 
                              k.vals = c(5, 10, 15), 
                              resolution.vals = c(.2, .3, .4), 
                              which.dim.reduc = "tsne", 
                              redo.embedding = TRUE, 
                              random.seed = 629)
DimPlot(reclust_res[[1]], reduction = "tsne")

DimPlot(reclust_res[[2]], reduction = "tsne")

FeaturePlot(reclust_res[[2]], reduction = "tsne", features = "HLA-DPB1")

DimPlot(reclust_res[[3]], reduction = "tsne")

FeaturePlot(reclust_res[[3]], reduction = "tsne", features = "CD14")

## [1] "Reclustering cells in cluster 0 using k = 15 & resolution = 0.3, which achieved silhouette score: 0.27"
## [1] "Reclustering cells in cluster 1 using k = 5 & resolution = 0.3, which achieved silhouette score: 0.27"
## [1] "Reclustering cells in cluster 4 using k = 5 & resolution = 0.3, which achieved silhouette score: 0.524"

Now we’ll run Fit-SNE on each of the reclustered objects for consistencies sake. First we’ll need to isolate the PC matrices and send them to Python.

pc_clust0 <- Embeddings(reclust_res[[1]], reduction = "pca")
pc_clust1 <- Embeddings(reclust_res[[2]], reduction = "pca")
pc_clust2 <- Embeddings(reclust_res[[3]], reduction = "pca")

Running Fit-SNE three times could probably be cleaned up and done in a for loop, but this was easiest to write out in a short time. We use different perplexity sets for each cluster, as they’re all composed of differing numbers of cells.

# import PC matrices
pc_clust0 = np.array(r.pc_clust0)
pc_clust1 = np.array(r.pc_clust1)
pc_clust2 = np.array(r.pc_clust2)

# run Fit-SNE w/ multiscale kernel - cluster 0
affin_multi_clust0 = Multiscale(pc_clust0, perplexities=[15, 50], metric='cosine', random_state=629)
init_clust0 = initialization.pca(pc_clust0, random_state=629)
tsne_obj_clust0 = TSNEEmbedding(init_clust0, affin_multi_clust0, negative_gradient_method='fft')
embed_clust0 = tsne_obj_clust0.optimize(n_iter=300, exaggeration=12, momentum=0.7)
embed_multi_clust0 = embed_clust0.optimize(n_iter=850, exaggeration=1, momentum=0.8)

# run Fit-SNE w/ multiscale kernel - cluster 1
affin_multi_clust1 = Multiscale(pc_clust1, perplexities=[15, 30], metric='cosine', random_state=629)
init_clust1 = initialization.pca(pc_clust1, random_state=629)
tsne_obj_clust1 = TSNEEmbedding(init_clust1, affin_multi_clust1, negative_gradient_method='fft')
embed_clust1 = tsne_obj_clust1.optimize(n_iter=300, exaggeration=12, momentum=0.6)
embed_multi_clust1 = embed_clust1.optimize(n_iter=850, exaggeration=1, momentum=0.8)

# run Fit-SNE w/ multiscale kernel - cluster 4
affin_multi_clust2 = Multiscale(pc_clust2, perplexities=[10, 30], metric='cosine', random_state=629)
init_clust2 = initialization.pca(pc_clust2, random_state=629)
tsne_obj_clust2 = TSNEEmbedding(init_clust2, affin_multi_clust2, negative_gradient_method='fft')
embed_clust2 = tsne_obj_clust2.optimize(n_iter=300, exaggeration=12, momentum=0.6)
embed_multi_clust2 = embed_clust2.optimize(n_iter=750, exaggeration=1, momentum=0.8)

Now we bring the results back in to R.

embed0 <- as.matrix(py$embed_multi_clust0)
rownames(embed0) <- colnames(reclust_res[[1]])
reclust_res[[1]]@reductions$bh_tsne <- reclust_res[[1]]@reductions$tsne
reclust_res[[1]]@reductions$fitsne <- CreateDimReducObject(embeddings = embed0, 
                                                           key = "FitSNE_",
                                                           assay = "SCT", 
                                                           global = TRUE)
embed1 <- as.matrix(py$embed_multi_clust1)
rownames(embed1) <- colnames(reclust_res[[2]])
reclust_res[[2]]@reductions$bh_tsne <- reclust_res[[2]]@reductions$tsne
reclust_res[[2]]@reductions$fitsne <- CreateDimReducObject(embeddings = embed1,
                                                           key = "FitSNE_",
                                                           assay = "SCT",
                                                           global = TRUE)
embed2 <- as.matrix(py$embed_multi_clust2)
rownames(embed2) <- colnames(reclust_res[[3]])
reclust_res[[3]]@reductions$bh_tsne <- reclust_res[[3]]@reductions$tsne
reclust_res[[3]]@reductions$fitsne <- CreateDimReducObject(embeddings = embed2,
                                                           key = "FitSNE_",
                                                           assay = "SCT",
                                                           global = TRUE)

Here’s what our reclusterings look like. There’s clear visual separation between the main clusters and the subgroups we’ve discovered using SCISSORS.

p2 <- DimPlot(reclust_res[[1]], cols = paletteer_d("miscpalettes::brightPastel")) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p3 <- DimPlot(reclust_res[[2]], cols = paletteer_d("miscpalettes::brightPastel")) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p4 <- DimPlot(reclust_res[[3]], cols = paletteer_d("miscpalettes::brightPastel")) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p2

p3

p4

Next, we’ll reintegrate our new clusters into our original Seurat object - this requires some finagling as Seurat is a bit weird with how it stores cell-level metadata. Since we had six clusters originally, and we discovered six new subclusters, we’ll end up with twelve total clusters.

clust_df <- data.frame(CellID = colnames(pbmc), 
                       ClustID = as.numeric(pbmc@meta.data$seurat_clusters) - 1, 
                       stringsAsFactors = FALSE)
clust_6 <- rownames(reclust_res[[1]]@meta.data[reclust_res[[1]]@meta.data$seurat_clusters == 1, ])
clust_7 <- rownames(reclust_res[[1]]@meta.data[reclust_res[[1]]@meta.data$seurat_clusters == 2, ])
clust_8 <- rownames(reclust_res[[2]]@meta.data[reclust_res[[2]]@meta.data$seurat_clusters == 1, ])
clust_9 <- rownames(reclust_res[[2]]@meta.data[reclust_res[[2]]@meta.data$seurat_clusters == 2, ])
clust_10 <- rownames(reclust_res[[3]]@meta.data[reclust_res[[3]]@meta.data$seurat_clusters == 1, ])
clust_11 <- rownames(reclust_res[[3]]@meta.data[reclust_res[[3]]@meta.data$seurat_clusters == 2, ])
label <- case_when(clust_df$CellID %in% clust_6 ~ 6, 
                   clust_df$CellID %in% clust_7 ~ 7, 
                   clust_df$CellID %in% clust_8 ~ 8, 
                   clust_df$CellID %in% clust_9 ~ 9, 
                   clust_df$CellID %in% clust_10 ~ 10, 
                   clust_df$CellID %in% clust_11 ~ 11, 
                   TRUE ~ clust_df$ClustID)
pbmc <- AddMetaData(pbmc, 
                    col.name = "seurat_clusters", 
                    metadata = as.factor(label))
Idents(pbmc) <- "seurat_clusters"
p5 <- DimPlot(pbmc, cols = tableau20()) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p5

Identify Cell Types

Now that we’ve determined our subpopulations, we can assign cell types to each cluster using the marker genes provided in the Satija Lab’s PBMC3k vignette, as well as other canonical markers.

CD4+ T Cells

We can quickly identify cluster 0 as the memory CD4+ T cells, and cluster 6 as the naive CD4+ population.

p6 <- FeaturePlot(pbmc, features = "IL7R") + 
      scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
      theme_yehlab() + 
      NoLegend() + 
      theme(axis.title = element_blank())
p7 <- FeaturePlot(pbmc, features = "CCR7") + 
      scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
      theme_yehlab()  + 
      NoLegend() + 
      theme(axis.title = element_blank())
p8 <- FeaturePlot(pbmc, features = "S100A4") + 
      scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
      theme_yehlab() + 
      NoLegend() + 
      theme(axis.title = element_blank())
(p6 | p7 | p8) / p5

Cluster 7 is only subtly separated from the CD4+ T cell clusters. We’ll use differential expression testing to determine if the cells in cluster 7 are a spurious cluster or a real T cell subtype. After running a Wilcox test we can see that several of the differentially expressed are associated with the interferon family of cytokines and with anti-viral immune responses.

FindAllMarkers(pbmc, 
               assay = "SCT",
               logfc.threshold = .5, 
               only.pos = TRUE, 
               test.use = "wilcox", 
               verbose = FALSE, 
               random.seed = 629) %>% 
  filter(cluster  == 7 & p_val_adj < .05) %>% 
  slice_head(n = 5)
p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
IFIT1 0 0.6684503 0.575 0.071 0 7 IFIT1
IFIT3 0 0.7114076 0.575 0.098 0 7 IFIT3
IFI61 0 1.3156295 0.950 0.374 0 7 IFI6
ISG151 0 1.3496338 0.950 0.421 0 7 ISG15
MX1 0 0.7809667 0.725 0.195 0 7 MX1

Upon visual inspection of the top three marker genes (as determined by adjusted p-value), we can see that they do an equally good job of distinguishing the small cluster from the sample as a whole as they do at separating it from the memory CD4+ T cells. Due to their anti-viral characteristics , we’ll define this group as being composed of Type 1 helper T cells (Th1).

p9 <- FeaturePlot(pbmc, features = "IFIT1") + 
      scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
      theme_yehlab() + 
      NoLegend() + 
      theme(axis.title = element_blank())
p10 <- FeaturePlot(pbmc, features = "IFIT3") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p11 <- FeaturePlot(pbmc, features = "IFI6") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
(p9 | p10 | p11) / p5

CD14+ Monocytes

Cluster 1 clearly houses our CD14+ monocytes.

p12 <- FeaturePlot(pbmc, features = "CD14") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p13 <- FeaturePlot(pbmc, features = "LYZ") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
(p12 | p13) / p5

FCGR3A+ Monocytes

The FCGR3A+ monocytes are positioned near the CD14+ monocytes in cluster 4. Note: these are also known as CD16+ monocytes.

p14 <- FeaturePlot(pbmc, features = "FCGR3A") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p15 <- FeaturePlot(pbmc, features = "MS4A7") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
(p14 | p15) / p5

Intermediate Monocytes

Lastly, it follows that between the CD14+ and FCGR3A+ monocytes are the intermediate monocytes, which are characterized by expression of CD14 and CD16, along with S100A8, CD74, HLA-DPB1, etc.

p16 <- FeaturePlot(pbmc, features = "HLA-DPB1") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p17 <- FeaturePlot(pbmc, features = "CD74") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p18 <- FeaturePlot(pbmc, features = "HLA-DRA") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
(p16 | p17 | p18) / p5

B Cells

Expression of MS4A1 allows us to isolate the B cells in cluster 3.

p19 <- FeaturePlot(pbmc, features = "MS4A1") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p19 / p5

CD8+ T Cells

The canonical marker CD8A swiftly identifies our CD8+ T cells in cluster 2.

p20 <- FeaturePlot(pbmc, features = "CD8A") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p20 / p5

Natural Killer Cells

We can use NKG7 and GNLY to isolate the NK cells in cluster 5.

p21 <- FeaturePlot(pbmc, features = "NKG7") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p22 <- FeaturePlot(pbmc, features = "GNLY") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
(p21 | p22) / p5

Dendritic Cells

The dendritic cell group is defined by expression of FCER1A and CST3 in cluster 9.

p23 <- FeaturePlot(pbmc, features = "FCER1A") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p24 <- FeaturePlot(pbmc, features = "CST3") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
(p23 | p24) / p5

Platelets

The tiny platelet population of 13 cells can be identified by its expression of PPBP in cluster 10.

p25 <- FeaturePlot(pbmc, features = "PPBP") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p25 / p5

Final Figure

Finally, we’ll add cell labels to our original Seurat object and plot the results.

pbmc$label <- case_when(pbmc$seurat_clusters == 0 ~ "Memory CD4+ T", 
                        pbmc$seurat_clusters == 1 ~ "CD14+ Monocyte", 
                        pbmc$seurat_clusters == 2 ~ "CD8+ T", 
                        pbmc$seurat_clusters == 3 ~ "B", 
                        pbmc$seurat_clusters == 4 ~ "FCGR3A+ Monocyte", 
                        pbmc$seurat_clusters == 5 ~ "NK", 
                        pbmc$seurat_clusters == 6 ~ "Naive CD4+ T", 
                        pbmc$seurat_clusters == 7 ~ "Th1", 
                        pbmc$seurat_clusters == 8 ~ "Intermediate Monocyte", 
                        pbmc$seurat_clusters == 9 ~ "DC", 
                        pbmc$seurat_clusters == 10 ~ "Platelet")

We visualize the final cells labels for our 11 clusters.

p26 <- DimPlot(pbmc, cols = paletteer_d("rcartocolor::Vivid"), group.by = "label") + 
       theme_yehlab() + 
       guides(color = guide_legend(nrow = 3, override.aes = list(size = 3))) + 
       labs(x = "Fit-SNE 1", y = "Fit-SNE 2", title = NULL)
p26

Conclusions

SCISSORS automatically split a large CD4+ T cluster into naive, memory, and Th1 CD4+ T cells. It successfully separated a small dendritic cell cluster from the CD14+ monocytes it had originally been grouped with, and teased out the intermediate monocyte population nestled between the CD14+ and CD16+ monocyte clusters. Lastly, it identified a tiny platelet cluster that had been erroneously grouped with the CD16+ monocytes. The intermediate monocyte and Th1 cells were not annotated in the original Satija Lab PBMC3k vignette.

We used the PBMC3k dataset because of 1) its immediate availability to anyone wishing to replicate our results and 2) the validity of its annotations, which allowed us to be confident in the results from SCISSORS, which was able to carve out rare cell groups from larger, broader cell types. In this case, the dendritic cell cluster was composed of 39 cells, the Th1 cluster of 40 cells, and the platelet cluster of just 13 cells. However, SCISSORS isn’t just useful for rare cell types: it also identified the intermediate monocytes, a cluster of 206 cells. We thus believe we can confidently say that SCISSORS has been shown to accurately and swiftly identify cell types, both common and rare, by considering the variance in gene expression within clusters and judging iterative reclustering using silhouette scores. We put forth that this approach is theoretically advantageous for identifying rare cell populations, rather than attempting to do so at the level of the entire dataset.

Save Data & Figures

This part isn’t really worth reading; it’s just here to prove that all the figures were actually dynamically generated and saved upon knitting this document.

We’ll create a quick convenience function to help us save the figures.

saveSCISSORS <- function(plot = NULL, 
                         name = NULL, 
                         border = TRUE, 
                         pub.ready = FALSE) {
  if (is.null(plot) | is.null(name)) stop("You forgot some arguments.")
  if (pub.ready) {
    dir <- "~/Desktop/R/SCISSORS/vignettes/figures_pub/PBMC"
    if (!border) {
      plot <- plot + 
              theme(panel.border = element_blank(), 
                    axis.title = element_blank(), 
                    legend.position = "none")
    } else {
      plot <- plot + 
              theme(axis.title = element_blank(), 
                    legend.position = "none")
    }
    ggsave(filename = paste0(name, ".pdf"), 
           device = "pdf", 
           units = "in",
           path = dir, 
           height = 8, 
           width = 8) 
  } else {
    dir <- "~/Desktop/R/SCISSORS/vignettes/figures_supp/PBMC"
    ggsave(filename = paste0(name, ".pdf"), 
           device = "pdf", 
           units = "in",
           path = dir, 
           height = 8, 
           width = 8) 
  }
}

Lastly, we’ll save the figures under ./vignettes/figures/.

saveSCISSORS(plot = p0, name = "Seurat_Clusters", pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p1, name = "Seurat_Clusters_FitSNE", pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p2, name = "Clust0_Reclust", pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p3, name = "Clust1_Reclust", pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p4, name = "Clust2_Reclust", pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p5, name = "SCISSORS_Clusters", pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p6, name = "CD4T_IL7R")
saveSCISSORS(plot = p7, name = "CD4T_CCR7")
saveSCISSORS(plot = p8, name = "CD4T_S100A4")
saveSCISSORS(plot = p9, name = "TH1_IFIT1")
saveSCISSORS(plot = p10, name = "TH1_IFIT3")
saveSCISSORS(plot = p11, name = "TH1_IFI6")
saveSCISSORS(plot = p12, name = "Monocyte_CD14")
saveSCISSORS(plot = p13, name = "Monocyte_LYZ")
saveSCISSORS(plot = p14, name = "FCGR3A_Monocyte_FCGR3A")
saveSCISSORS(plot = p15, name = "FCGR3A_Monocyte_MS4A7")
saveSCISSORS(plot = p16, name = "Intermediate_Mono_HLADPB1")
saveSCISSORS(plot = p17, name = "Intermediate_Mono_CD74")
saveSCISSORS(plot = p18, name = "Intermediate_Mono_HLADRA")
saveSCISSORS(plot = p19, name = "B_MS4A1")
saveSCISSORS(plot = p20, name = "CD8T_CD8A")
saveSCISSORS(plot = p21, name = "NK_NKG7")
saveSCISSORS(plot = p22, name = "NK_GNLY")
saveSCISSORS(plot = p23, name = "DC_FCER1A")
saveSCISSORS(plot = p24, name = "DC_CST3")
saveSCISSORS(plot = p25, name = "Platelet_PPBP")
saveSCISSORS(plot = p26, name = "SCISSORS_final_labels", pub.ready = TRUE, border = FALSE)

And of course:

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] pbmc3k.SeuratData_3.1.4     SeuratData_0.2.1           
##  [3] reticulate_1.18             paletteer_1.3.0            
##  [5] SCISSORS_0.0.2.0            SingleCellExperiment_1.12.0
##  [7] SummarizedExperiment_1.20.0 Biobase_2.50.0             
##  [9] GenomicRanges_1.42.0        GenomeInfoDb_1.26.2        
## [11] IRanges_2.24.1              S4Vectors_0.28.1           
## [13] BiocGenerics_0.36.0         MatrixGenerics_1.2.0       
## [15] matrixStats_0.57.0          data.table_1.13.6          
## [17] cluster_2.1.0               biomaRt_2.44.1             
## [19] ggplot2_3.3.3               Seurat_3.2.3               
## [21] dplyr_1.0.2                 pals_1.6                   
## 
## loaded via a namespace (and not attached):
##   [1] BiocFileCache_1.12.1   plyr_1.8.6             igraph_1.2.6          
##   [4] lazyeval_0.2.2         splines_4.0.3          listenv_0.8.0         
##   [7] scattermore_0.7        digest_0.6.27          htmltools_0.5.0       
##  [10] wesanderson_0.3.6.9000 fansi_0.4.1            magrittr_2.0.1        
##  [13] memoise_1.1.0          tensor_1.5             ROCR_1.0-11           
##  [16] limma_3.44.3           globals_0.14.0         askpass_1.1           
##  [19] prettyunits_1.1.1      colorspace_2.0-0       blob_1.2.1            
##  [22] rappdirs_0.3.1         ggrepel_0.9.0          xfun_0.20             
##  [25] prismatic_1.0.0        crayon_1.3.4           RCurl_1.98-1.2        
##  [28] jsonlite_1.7.2         spatstat_1.64-1        spatstat.data_1.7-0   
##  [31] survival_3.2-7         zoo_1.8-8              glue_1.4.2            
##  [34] polyclip_1.10-0        gtable_0.3.0           zlibbioc_1.36.0       
##  [37] XVector_0.30.0         leiden_0.3.6           DelayedArray_0.16.0   
##  [40] future.apply_1.7.0     maps_3.3.0             abind_1.4-5           
##  [43] scales_1.1.1           DBI_1.1.0              miniUI_0.1.1.1        
##  [46] Rcpp_1.0.5             viridisLite_0.3.0      xtable_1.8-4          
##  [49] progress_1.2.2         bit_4.0.4              rsvd_1.0.3            
##  [52] mapproj_1.2.7          htmlwidgets_1.5.3      httr_1.4.2            
##  [55] RColorBrewer_1.1-2     ellipsis_0.3.1         ica_1.0-2             
##  [58] farver_2.0.3           pkgconfig_2.0.3        XML_3.99-0.5          
##  [61] uwot_0.1.10            dbplyr_2.0.0           deldir_0.2-3          
##  [64] labeling_0.4.2         tidyselect_1.1.0       rlang_0.4.10          
##  [67] reshape2_1.4.4         later_1.1.0.1          AnnotationDbi_1.50.3  
##  [70] munsell_0.5.0          tools_4.0.3            cli_2.2.0             
##  [73] generics_0.1.0         RSQLite_2.2.2          ggridges_0.5.3        
##  [76] evaluate_0.14          stringr_1.4.0          fastmap_1.0.1         
##  [79] yaml_2.2.1             goftest_1.2-2          rematch2_2.1.2        
##  [82] knitr_1.30             bit64_4.0.5            fitdistrplus_1.1-3    
##  [85] purrr_0.3.4            RANN_2.6.1             pbapply_1.4-3         
##  [88] future_1.21.0          nlme_3.1-151           mime_0.9              
##  [91] rstudioapi_0.13        compiler_4.0.3         plotly_4.9.3          
##  [94] curl_4.3               png_0.1-7              spatstat.utils_1.20-2 
##  [97] tibble_3.0.4           stringi_1.5.3          highr_0.8             
## [100] lattice_0.20-41        Matrix_1.3-2           vctrs_0.3.6           
## [103] pillar_1.4.7           lifecycle_0.2.0        lmtest_0.9-38         
## [106] RcppAnnoy_0.0.18       cowplot_1.1.1          bitops_1.0-6          
## [109] irlba_2.3.3            httpuv_1.5.4           patchwork_1.1.1       
## [112] R6_2.5.0               promises_1.1.1         KernSmooth_2.23-18    
## [115] gridExtra_2.3          parallelly_1.23.0      codetools_0.2-18      
## [118] dichromat_2.0-0        MASS_7.3-53            assertthat_0.2.1      
## [121] openssl_1.4.3          withr_2.3.0            sctransform_0.3.2     
## [124] GenomeInfoDbData_1.2.4 mgcv_1.8-33            hms_0.5.3             
## [127] grid_4.0.3             rpart_4.1-15           tidyr_1.1.2           
## [130] rmarkdown_2.6          Rtsne_0.15             shiny_1.5.0
---
title: "PBMC Analysis using SCISSORS"
subtitle: "Jack Leary"
author: 
  - "University of North Carolina at Chapel Hill - Lineberger Comprehensive Cancer Center"
  - "University of Florida - Department of Biostatistics"
date: "`r Sys.Date()`"
output:
  html_document:
    theme: paper
    highlight: tango
    df_print: kable
    toc: true
    toc_float: true
    code_folding: show
    code_download: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, 
                      message = FALSE, 
                      warning = FALSE, 
                      fig.align = "center")
reticulate::use_virtualenv("~/Desktop/Python/science/venv/", required = TRUE)
```

# Introduction

This document will serve as a tutorial for using `SCISSORS`, with the added functionality of detailing exactly how the PBMC3k dataset figures presented in our manuscript were generated. We'll start from a 10X counts matrix and end with fully annotated cell clusters. In addition to R, in order to run all the code in this document you'll need a Python 3 installation with `openTSNE` and all its dependencies installed.

# Libraries

## R

```{r}
library(pals)        # basic colors
library(dplyr)       # tidy data manipulation
library(Seurat)      # single cell infrastructure
library(ggplot2)     # plots
library(SCISSORS)    # our package 
library(paletteer)   # advanced colors
library(reticulate)  # Python interface
library(SeuratData)  # PBMC3k dataset
```

## Python

```{python}
import numpy as np
from openTSNE import TSNEEmbedding
from openTSNE import initialization
from openTSNE.affinity import Multiscale
from openTSNE.affinity import PerplexityBasedNN
```

# Data

We load a scRNA-seq dataset provided by 10X Genomics that consists of 2,700 peripheral blood mononuclear cells (PBMCs) from a healthy donor.

```{r}
pbmc <- LoadData("pbmc3k")
```

# Preprocessing

Here we use `PrepareData()` to normalize expression & select highly variable genes through `sctransform`, run PCA & t-SNE. and cluster our cells. We utilize 15 principal components even though [the Satija Lab vignette]((https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html)) used 10, as we use `sctransform` normalization, which does a better job of retaining biological heterogeneity through normalization than standard log-normalization.

```{r, warning=FALSE, message=FALSE}
pbmc <- PrepareData(pbmc, 
                    n.variable.genes = 4000, 
                    n.PC = 15, 
                    which.dim.reduc = "tsne",
                    initial.resolution = .4, 
                    random.seed = 629)
p0 <- DimPlot(pbmc, cols = tableau20()) + 
      labs(x = "t-SNE 1", y = "t-SNE 2") + 
      theme_yehlab() + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p0
```

## Fit-SNE

We'll also run the Fast Fourier Transform-accelerated version of t-SNE as implemented in the Python library `openTSNE`. First we'll need to get our PC matrix into a form accessible by our Python interpreter.

```{r}
pc_mat <- Embeddings(pbmc, reduction = "pca")
```

```{python}
# import PC matrix
pc_mat = np.array(r.pc_mat)
# run Fit-SNE w/ multiscale kernel
init = initialization.pca(pc_mat, random_state=629)
affin_anneal = PerplexityBasedNN(pc_mat, perplexity=100, metric='cosine', random_state=629)
tsne = TSNEEmbedding(init, affin_anneal, negative_gradient_method='fft')
embed1 = tsne.optimize(n_iter=250, exaggeration=10, momentum=0.6)
embed2 = embed1.optimize(n_iter=750, exaggeration=1, momentum=0.8)
affin_anneal.set_perplexity(30)
embed3 = embed2.optimize(n_iter=15, exaggeration=2, momentum=0.8)
embed4 = embed3.optimize(n_iter=250)
```

Now we pull the results back into R, and save them in our `Seurat` object. We save the original embedding (made using the default Barnes-Hut t-SNE implementation) in `pbmc@reduction$bh_tsne` - we need to do this in order to make the Fit-SNE embedding the default embedding that will be retrieved in calls to functions such as `DimPlot()` or `FeaturePlot()`. 

```{r}
embed <- as.matrix(py$embed4)
rownames(embed) <- colnames(pbmc)
colnames(embed) <- c("Fit-SNE_1", "Fit-SNE_2")
pbmc@reductions$bh_tsne <- pbmc@reductions$tsne
pbmc@reductions$tsne <- CreateDimReducObject(embeddings = embed, 
                                             key = "FitSNE_", 
                                             assay = "SCT", 
                                             global = TRUE)
```

Visualizing the results shows pretty much the same global structure as with the default t-SNE implementation, albeit rotated a bit, but I like that Fit-SNE's clusters are a bit denser, so we'll use Fit-SNE going forward.

```{r}
p1 <- DimPlot(pbmc, cols = tableau20()) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p1
```

# Reclustering

Here we'll examine clusters 0, 1, and 4. Cluster 0 seems large enough, and has enough variability on the X-axis of the t-SNE embedding to appear as though it might harbor subgroups. Clusters 1 and 4 have small but visible subclusters.

```{r, warning=FALSE, message=FALSE, results='hold'}
reclust_res <- ReclusterCells(pbmc, 
                              which.clust = list(0, 1, 4), 
                              n.variable.genes = 4000,
                              n.PC = 15, 
                              k.vals = c(5, 10, 15), 
                              resolution.vals = c(.2, .3, .4), 
                              which.dim.reduc = "tsne", 
                              redo.embedding = TRUE, 
                              random.seed = 629)
DimPlot(reclust_res[[1]], reduction = "tsne")
DimPlot(reclust_res[[2]], reduction = "tsne")
FeaturePlot(reclust_res[[2]], reduction = "tsne", features = "HLA-DPB1")
DimPlot(reclust_res[[3]], reduction = "tsne")
FeaturePlot(reclust_res[[3]], reduction = "tsne", features = "CD14")
```

Now we'll run Fit-SNE on each of the reclustered objects for consistencies sake. First we'll need to isolate the PC matrices and send them to Python. 

```{r}
pc_clust0 <- Embeddings(reclust_res[[1]], reduction = "pca")
pc_clust1 <- Embeddings(reclust_res[[2]], reduction = "pca")
pc_clust2 <- Embeddings(reclust_res[[3]], reduction = "pca")
```

Running Fit-SNE three times could probably be cleaned up and done in a for loop, but this was easiest to write out in a short time. We use different perplexity sets for each cluster, as they're all composed of differing numbers of cells.

```{python}
# import PC matrices
pc_clust0 = np.array(r.pc_clust0)
pc_clust1 = np.array(r.pc_clust1)
pc_clust2 = np.array(r.pc_clust2)

# run Fit-SNE w/ multiscale kernel - cluster 0
affin_multi_clust0 = Multiscale(pc_clust0, perplexities=[15, 50], metric='cosine', random_state=629)
init_clust0 = initialization.pca(pc_clust0, random_state=629)
tsne_obj_clust0 = TSNEEmbedding(init_clust0, affin_multi_clust0, negative_gradient_method='fft')
embed_clust0 = tsne_obj_clust0.optimize(n_iter=300, exaggeration=12, momentum=0.7)
embed_multi_clust0 = embed_clust0.optimize(n_iter=850, exaggeration=1, momentum=0.8)

# run Fit-SNE w/ multiscale kernel - cluster 1
affin_multi_clust1 = Multiscale(pc_clust1, perplexities=[15, 30], metric='cosine', random_state=629)
init_clust1 = initialization.pca(pc_clust1, random_state=629)
tsne_obj_clust1 = TSNEEmbedding(init_clust1, affin_multi_clust1, negative_gradient_method='fft')
embed_clust1 = tsne_obj_clust1.optimize(n_iter=300, exaggeration=12, momentum=0.6)
embed_multi_clust1 = embed_clust1.optimize(n_iter=850, exaggeration=1, momentum=0.8)

# run Fit-SNE w/ multiscale kernel - cluster 4
affin_multi_clust2 = Multiscale(pc_clust2, perplexities=[10, 30], metric='cosine', random_state=629)
init_clust2 = initialization.pca(pc_clust2, random_state=629)
tsne_obj_clust2 = TSNEEmbedding(init_clust2, affin_multi_clust2, negative_gradient_method='fft')
embed_clust2 = tsne_obj_clust2.optimize(n_iter=300, exaggeration=12, momentum=0.6)
embed_multi_clust2 = embed_clust2.optimize(n_iter=750, exaggeration=1, momentum=0.8)
```

Now we bring the results back in to R.

```{r}
embed0 <- as.matrix(py$embed_multi_clust0)
rownames(embed0) <- colnames(reclust_res[[1]])
reclust_res[[1]]@reductions$bh_tsne <- reclust_res[[1]]@reductions$tsne
reclust_res[[1]]@reductions$fitsne <- CreateDimReducObject(embeddings = embed0, 
                                                           key = "FitSNE_",
                                                           assay = "SCT", 
                                                           global = TRUE)
embed1 <- as.matrix(py$embed_multi_clust1)
rownames(embed1) <- colnames(reclust_res[[2]])
reclust_res[[2]]@reductions$bh_tsne <- reclust_res[[2]]@reductions$tsne
reclust_res[[2]]@reductions$fitsne <- CreateDimReducObject(embeddings = embed1,
                                                           key = "FitSNE_",
                                                           assay = "SCT",
                                                           global = TRUE)
embed2 <- as.matrix(py$embed_multi_clust2)
rownames(embed2) <- colnames(reclust_res[[3]])
reclust_res[[3]]@reductions$bh_tsne <- reclust_res[[3]]@reductions$tsne
reclust_res[[3]]@reductions$fitsne <- CreateDimReducObject(embeddings = embed2,
                                                           key = "FitSNE_",
                                                           assay = "SCT",
                                                           global = TRUE)
```

Here's what our reclusterings look like. There's clear visual separation between the main clusters and the subgroups we've discovered using `SCISSORS`.

```{r}
p2 <- DimPlot(reclust_res[[1]], cols = paletteer_d("miscpalettes::brightPastel")) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p3 <- DimPlot(reclust_res[[2]], cols = paletteer_d("miscpalettes::brightPastel")) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p4 <- DimPlot(reclust_res[[3]], cols = paletteer_d("miscpalettes::brightPastel")) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p2
p3
p4
```

Next, we'll reintegrate our new clusters into our original `Seurat` object - this requires some finagling as `Seurat` is a bit weird with how it stores cell-level metadata. Since we had six clusters originally, and we discovered six new subclusters, we'll end up with twelve total clusters.

```{r, message=FALSE}
clust_df <- data.frame(CellID = colnames(pbmc), 
                       ClustID = as.numeric(pbmc@meta.data$seurat_clusters) - 1, 
                       stringsAsFactors = FALSE)
clust_6 <- rownames(reclust_res[[1]]@meta.data[reclust_res[[1]]@meta.data$seurat_clusters == 1, ])
clust_7 <- rownames(reclust_res[[1]]@meta.data[reclust_res[[1]]@meta.data$seurat_clusters == 2, ])
clust_8 <- rownames(reclust_res[[2]]@meta.data[reclust_res[[2]]@meta.data$seurat_clusters == 1, ])
clust_9 <- rownames(reclust_res[[2]]@meta.data[reclust_res[[2]]@meta.data$seurat_clusters == 2, ])
clust_10 <- rownames(reclust_res[[3]]@meta.data[reclust_res[[3]]@meta.data$seurat_clusters == 1, ])
clust_11 <- rownames(reclust_res[[3]]@meta.data[reclust_res[[3]]@meta.data$seurat_clusters == 2, ])
label <- case_when(clust_df$CellID %in% clust_6 ~ 6, 
                   clust_df$CellID %in% clust_7 ~ 7, 
                   clust_df$CellID %in% clust_8 ~ 8, 
                   clust_df$CellID %in% clust_9 ~ 9, 
                   clust_df$CellID %in% clust_10 ~ 10, 
                   clust_df$CellID %in% clust_11 ~ 11, 
                   TRUE ~ clust_df$ClustID)
pbmc <- AddMetaData(pbmc, 
                    col.name = "seurat_clusters", 
                    metadata = as.factor(label))
Idents(pbmc) <- "seurat_clusters"
p5 <- DimPlot(pbmc, cols = tableau20()) + 
      labs(x = "Fit-SNE 1", y = "Fit-SNE 2") + 
      theme_yehlab() + 
      guides(color = guide_legend(nrow = 1, override.aes = list(size = 3)))
p5
```

# Identify Cell Types

Now that we've determined our subpopulations, we can assign cell types to each cluster using the marker genes provided in [the Satija Lab's PBMC3k vignette](https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html), as well as other canonical markers.

## CD4+ T Cells

We can quickly identify cluster 0 as the memory CD4+ T cells, and cluster 6 as the naive CD4+ population.

```{r}
p6 <- FeaturePlot(pbmc, features = "IL7R") + 
      scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
      theme_yehlab() + 
      NoLegend() + 
      theme(axis.title = element_blank())
p7 <- FeaturePlot(pbmc, features = "CCR7") + 
      scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
      theme_yehlab()  + 
      NoLegend() + 
      theme(axis.title = element_blank())
p8 <- FeaturePlot(pbmc, features = "S100A4") + 
      scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
      theme_yehlab() + 
      NoLegend() + 
      theme(axis.title = element_blank())
(p6 | p7 | p8) / p5
```

Cluster 7 is only subtly separated from the CD4+ T cell clusters. We'll use differential expression testing to determine if the cells in cluster 7 are a spurious cluster or a real T cell subtype. After running a Wilcox test we can see that several of the differentially expressed are associated with the interferon family of cytokines and with anti-viral immune responses.

```{r}
FindAllMarkers(pbmc, 
               assay = "SCT",
               logfc.threshold = .5, 
               only.pos = TRUE, 
               test.use = "wilcox", 
               verbose = FALSE, 
               random.seed = 629) %>% 
  filter(cluster  == 7 & p_val_adj < .05) %>% 
  slice_head(n = 5)
```

Upon visual inspection of the top three marker genes (as determined by adjusted p-value), we can see that they do an equally good job of distinguishing the small cluster from the sample as a whole as they do at separating it from the memory CD4+ T cells. Due to their anti-viral characteristics , we'll define this group as being composed of Type 1 helper T cells (Th1).

```{r}
p9 <- FeaturePlot(pbmc, features = "IFIT1") + 
      scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
      theme_yehlab() + 
      NoLegend() + 
      theme(axis.title = element_blank())
p10 <- FeaturePlot(pbmc, features = "IFIT3") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p11 <- FeaturePlot(pbmc, features = "IFI6") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
(p9 | p10 | p11) / p5
```

## CD14+ Monocytes

Cluster 1 clearly houses our CD14+ monocytes.

```{r}
p12 <- FeaturePlot(pbmc, features = "CD14") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p13 <- FeaturePlot(pbmc, features = "LYZ") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
(p12 | p13) / p5
```

## FCGR3A+ Monocytes

The FCGR3A+ monocytes are positioned near the CD14+ monocytes in cluster 4. Note: these are also known as CD16+ monocytes. 

```{r}
p14 <- FeaturePlot(pbmc, features = "FCGR3A") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p15 <- FeaturePlot(pbmc, features = "MS4A7") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
(p14 | p15) / p5
```

## Intermediate Monocytes

Lastly, it follows that between the CD14+ and FCGR3A+ monocytes are the intermediate monocytes, which [are characterized by](https://www.frontiersin.org/articles/10.3389/fimmu.2019.02035/full#B3) expression of CD14 and CD16, along with S100A8, CD74, HLA-DPB1, etc. 

```{r}
p16 <- FeaturePlot(pbmc, features = "HLA-DPB1") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p17 <- FeaturePlot(pbmc, features = "CD74") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p18 <- FeaturePlot(pbmc, features = "HLA-DRA") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
(p16 | p17 | p18) / p5
```


## B Cells

Expression of MS4A1 allows us to isolate the B cells in cluster 3.

```{r}
p19 <- FeaturePlot(pbmc, features = "MS4A1") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p19 / p5
```

## CD8+ T Cells

The canonical marker CD8A swiftly identifies our CD8+ T cells in cluster 2.

```{r}
p20 <- FeaturePlot(pbmc, features = "CD8A") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p20 / p5
```

## Natural Killer Cells

We can use NKG7 and GNLY to isolate the NK cells in cluster 5.

```{r}
p21 <- FeaturePlot(pbmc, features = "NKG7") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p22 <- FeaturePlot(pbmc, features = "GNLY") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
(p21 | p22) / p5
```

## Dendritic Cells

The dendritic cell group is defined by expression of FCER1A and CST3 in cluster 9.

```{r}
p23 <- FeaturePlot(pbmc, features = "FCER1A") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p24 <- FeaturePlot(pbmc, features = "CST3") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
(p23 | p24) / p5
```

## Platelets

The tiny platelet population of `r sum(pbmc$seurat_clusters == 10)` cells can be identified by its expression of PPBP in cluster 10.

```{r}
p25 <- FeaturePlot(pbmc, features = "PPBP") + 
       scale_color_gradientn(colors = wesanderson::wes_palette("Zissou1")) + 
       theme_yehlab() + 
       NoLegend() + 
       theme(axis.title = element_blank())
p25 / p5
```

## Final Figure

Finally, we'll add cell labels to our original `Seurat` object and plot the results.

```{r}
pbmc$label <- case_when(pbmc$seurat_clusters == 0 ~ "Memory CD4+ T", 
                        pbmc$seurat_clusters == 1 ~ "CD14+ Monocyte", 
                        pbmc$seurat_clusters == 2 ~ "CD8+ T", 
                        pbmc$seurat_clusters == 3 ~ "B", 
                        pbmc$seurat_clusters == 4 ~ "FCGR3A+ Monocyte", 
                        pbmc$seurat_clusters == 5 ~ "NK", 
                        pbmc$seurat_clusters == 6 ~ "Naive CD4+ T", 
                        pbmc$seurat_clusters == 7 ~ "Th1", 
                        pbmc$seurat_clusters == 8 ~ "Intermediate Monocyte", 
                        pbmc$seurat_clusters == 9 ~ "DC", 
                        pbmc$seurat_clusters == 10 ~ "Platelet")
```

We visualize the final cells labels for our 11 clusters.  

```{r}
p26 <- DimPlot(pbmc, cols = paletteer_d("rcartocolor::Vivid"), group.by = "label") + 
       theme_yehlab() + 
       guides(color = guide_legend(nrow = 3, override.aes = list(size = 3))) + 
       labs(x = "Fit-SNE 1", y = "Fit-SNE 2", title = NULL)
p26
```

# Conclusions

SCISSORS automatically split a large CD4+ T cluster into naive, memory, and Th1 CD4+ T cells. It successfully separated a small dendritic cell cluster from the CD14+ monocytes it had originally been grouped with, and teased out the intermediate monocyte population nestled between the CD14+ and CD16+ monocyte clusters. Lastly, it identified a tiny platelet cluster that had been erroneously grouped with the CD16+ monocytes. The intermediate monocyte and Th1 cells were not annotated in [the original Satija Lab PBMC3k vignette](https://satijalab.org/seurat/v3.2/pbmc3k_tutorial.html). 

We used the PBMC3k dataset because of 1) its immediate availability to anyone wishing to replicate our results and 2) the validity of its annotations, which allowed us to be confident in the results from SCISSORS, which was able to carve out rare cell groups from larger, broader cell types. In this case, the dendritic cell cluster was composed of 39 cells, the Th1 cluster of 40 cells, and the platelet cluster of just 13 cells. However, SCISSORS isn't just useful for rare cell types: it also identified the intermediate monocytes, a cluster of 206 cells. We thus believe we can confidently say that SCISSORS has been shown to accurately and swiftly identify cell types, both common and rare, by considering the variance in gene expression within clusters and judging iterative reclustering using silhouette scores. We put forth that this approach is theoretically advantageous for identifying rare cell populations, rather than attempting to do so at the level of the entire dataset.

# Save Data & Figures

This part isn't really worth reading; it's just here to prove that all the figures were actually dynamically generated and saved upon knitting this document.

We'll create a quick convenience function to help us save the figures.

```{r}
saveSCISSORS <- function(plot = NULL, 
                         name = NULL, 
                         border = TRUE, 
                         pub.ready = FALSE) {
  if (is.null(plot) | is.null(name)) stop("You forgot some arguments.")
  if (pub.ready) {
    dir <- "~/Desktop/R/SCISSORS/vignettes/figures_pub/PBMC"
    if (!border) {
      plot <- plot + 
              theme(panel.border = element_blank(), 
                    axis.title = element_blank(), 
                    legend.position = "none")
    } else {
      plot <- plot + 
              theme(axis.title = element_blank(), 
                    legend.position = "none")
    }
    ggsave(filename = paste0(name, ".pdf"), 
           device = "pdf", 
           units = "in",
           path = dir, 
           height = 8, 
           width = 8) 
  } else {
    dir <- "~/Desktop/R/SCISSORS/vignettes/figures_supp/PBMC"
    ggsave(filename = paste0(name, ".pdf"), 
           device = "pdf", 
           units = "in",
           path = dir, 
           height = 8, 
           width = 8) 
  }
}
```

Lastly, we'll save the figures under `./vignettes/figures/`. 

```{r}
saveSCISSORS(plot = p0, name = "Seurat_Clusters", pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p1, name = "Seurat_Clusters_FitSNE", pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p2, name = "Clust0_Reclust", pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p3, name = "Clust1_Reclust", pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p4, name = "Clust2_Reclust", pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p5, name = "SCISSORS_Clusters", pub.ready = TRUE, border = FALSE)
saveSCISSORS(plot = p6, name = "CD4T_IL7R")
saveSCISSORS(plot = p7, name = "CD4T_CCR7")
saveSCISSORS(plot = p8, name = "CD4T_S100A4")
saveSCISSORS(plot = p9, name = "TH1_IFIT1")
saveSCISSORS(plot = p10, name = "TH1_IFIT3")
saveSCISSORS(plot = p11, name = "TH1_IFI6")
saveSCISSORS(plot = p12, name = "Monocyte_CD14")
saveSCISSORS(plot = p13, name = "Monocyte_LYZ")
saveSCISSORS(plot = p14, name = "FCGR3A_Monocyte_FCGR3A")
saveSCISSORS(plot = p15, name = "FCGR3A_Monocyte_MS4A7")
saveSCISSORS(plot = p16, name = "Intermediate_Mono_HLADPB1")
saveSCISSORS(plot = p17, name = "Intermediate_Mono_CD74")
saveSCISSORS(plot = p18, name = "Intermediate_Mono_HLADRA")
saveSCISSORS(plot = p19, name = "B_MS4A1")
saveSCISSORS(plot = p20, name = "CD8T_CD8A")
saveSCISSORS(plot = p21, name = "NK_NKG7")
saveSCISSORS(plot = p22, name = "NK_GNLY")
saveSCISSORS(plot = p23, name = "DC_FCER1A")
saveSCISSORS(plot = p24, name = "DC_CST3")
saveSCISSORS(plot = p25, name = "Platelet_PPBP")
saveSCISSORS(plot = p26, name = "SCISSORS_final_labels", pub.ready = TRUE, border = FALSE)
```

And of course:

```{r}
sessionInfo()
```
